BigData, Cloud and OpenSource

fabio.cecaro

12 years ago

It is now known that a new BuzzWord is emerging on the Internet and is perhaps replacing, or adding, to the old “Cloud Computing”.

Big Data.

The two terms, which have been the subject of internet media bombardment, are also two terms that are very closely linked to each other, as we want to demonstrate below. As we also want to show how the BigData world is also deeply linked to the OpenSource world, let’s see, for example, the link between Cloud Computing and OpenSource, in our old note.

Definition

Let’s start with some definitions of BigData, to try to curb the usual sensationalist journalistic errors, or intentional errors for marketing purposes, as we have been abundantly accustomed to regarding Cloud Computing. Is there any official definition?

We quote, WikiPedia, Gartner, IBM and Villanova University in Tampa (Florida) and further on NIST:

WikiPedia EN

WikiPedia IT

Gartner – taken from the glossary

IBM – and I also suggest this infographic of theirs of the 4Vs

Villanova University

Therefore, everyone seems to agree in defining big data as “a collection of data so large and complex that it requires different tools from traditional ones to be analyzed and visualized”. Then a few differences begin:

Everyone agrees that “The data would potentially come from heterogeneous sources”, and here there are those who argue that they are all “structured data” and those who instead also add “unstructured data” to it.

Let’s come to the dimensions that this data must have to be called BigData, here obviously there is discordance and wikipedia in English rightly argues that the bigdata size is constantly on the move, it could not be otherwise considering the many studies that every year analyze the growth of data produced worldwide. In 2012, there was talk of a range of tens of terabytes to several petabytes, for each dataset, while now we are talking about zettabytes (billions of terabytes).

On the merits, we cite this provocative article by Marco Russo sent to Luca De Biase and published by him in his blog.

Everyone agrees on the 3 Vs on the characteristics of Big Data:

volume: ability to acquire, store and access large volumes of data;
speed: ability to perform data analysis in real or near real-time;
variety: refers to the various types of data, coming from different sources.

And some speak of a 4′ V:

truthfulness: i.e. the quality of the data understood as the information value that can be extracted

But what is NIST doing about the definition of Big Data? We know that NIST moves slowly and cumbersomely, we learned this from the many months or rather years in which the definition of Cloud Computing was permanently in draft, and they started working on it since 2008.

Well, NIST begins to move when the U.S. government decides to allocate $200 million in the BigData Initiative, so the NIST BigData WorkShop and a Working Group open to all starts, as was done for the definition and all the documents related to the term Cloud Computing

Ecosystem

To show the size of the global ecosystem that revolves around this term, let’s look at three infographics from Bloomberg, Forbes and Capgemini respectively.

Big-Data-Landscape-bloomberg — Bloomberg

Already from these three infographics it is evident how OpenSource solutions are massively used in the BigData ecosystem, even Forbes puts only OpenSource software in the technologies.

Dimension

Let’s take a look at the market and growth around this BigData ecosystem

According to Gartner (2012 data), Big Data Will Drive $28 Billion of IT Spending , Big Data Creates Big Jobs: 4.4 Million IT Jobs Globally to Support Big Data By 2015

And now let’s enjoy these two infographics, one from Asigra and one from IBM, which is very active in the BigData world:

In short, the Big Data market basically requires a few things:

Big data storage systems, but really big
Large Parallel Computing Capacity
Qualified personnel (Data Analyst, Data Scientist) capable of “sniffing” interesting results by analyzing large amounts of apparently unrelated data.
Continuous data acquisition software, data analysis software and visual data representation software

Opportunity

BigData, in my opinion, is a great opportunity for large HW and SW IT companies (IBM, HP, EMC, Oracle, etc) as it awakens the needs of companies towards the purchase of HW rather than the use of the Public Cloud. There is also a growing need for simple, dedicated and customized SW for Data Analysis. Of course, in many cases it could be maintained and processed at Cloud Providers, and this is what market leaders such as AWS have been allowing to do for some time now, with DynamoDB, RedShift, Elastic MapReduce, but keeping petabytes or zettabytes (if these are the values we have to refer to in order to talk about Bigdata) in the Cloud costs a lot and I think it may even be convenient to maintain your own infrastructure. It’s different if we have a few terabytes of data on which we want to do DataAnalysis, and I think this is the most general scenario, where the services of a public cloud like AWS become truly competitive.

Recently, the big IT companies have opened up many opportunities for companies, startups and the world of research related to Big Data, for example EMC announces the Hadoop Starter kit 2.0, or Microsoft that offers Hadoop in the Azure cloud, or SAS allies with SAP on the Hana platform, also SAP HANA onDemand in AWS, or INTEL and AWS that offer trials and free, in short, there is something for everyone, it is a real explosion for the IT economy.

Open Source and Cloud Computing

On BigData and Cloud Computing in practice we have already answered, the possibilities are many, we have mentioned the leader maximo (AWS) and Azure, as Public Cloud offers, but also Google does not lack useful tools (BigQuery), on the other hand just remember the famous and now old BigTable by Google, which is used for their search engine.

The Public Cloud, even in the case of Big Data, can be very useful and very democratic (if we do not consider the size of the datasets as well as the definitions would have it). Think of the simplicity of not having to manage storage systems, backups, disaster recovery, of not having to manage DataAnalysis SW (if we use some PaaS or SaaS solution), of the simplicity of being able to maintain little active power during periods of non-analysis (paying little) and of being able to instantiate computing power only during our queries.

Now we come to BigData and OpenSource; as we have been able to detect so far, one name resonates strongly in all the scenarios mentioned so far, HADOOP.

Hadoop is an open-source software framework (Apache 2.0 license) for storing and processing large amounts of data in commodity hardware clusters; It was born in 2005 by Doug Cutting and Mike Cafarella and if I remember correctly it was born as a SW emulation of Google’s BigTable, for competing search engine projects.

A lot of distributed storage solutions have come out of this project, and so have many distributed storage solutions. For example, Hadoop has many child projects, such as:

HDFS, a distributed file system.
Cassandra, a scalable multi-master database with no point of failure (used by Facebook).
HBase, a distributed database for structured data for very large tables (billions of rows and millions of columns).
Hive, a data warehouse that allows easy querying and management of large datasets residing in distributed storage.
Pig, a platform for analyzing large data sets consisting of a high-level language. Pig programs have the characteristic of being able to run in parallel, so it allows the analysis of large amounts of data in a short time.
Mahout, a project to produce scalable machine-learning algorithm implementations mainly focused in the areas of collaborative filtering, clustering and classification.

to mention the most well-known in the Hadoop world.

But open source at the service of Big Data doesn’t stop there:

Infinidb, is a database for data warehousing. MySQL-enabled with column-oriented technology, it is purpose-built for analytics, analytical query, transactional support, and bulk load workloads. Also enabled for Hadoop.
SciDB is an all-in-one advanced analytics and data management platform. Highly scalable, for complex analysis with a data versioning system, for commercial and scientific needs. It is a software platform capable of running on a grid of commodity hardware or in the Cloud.
OpenTSDB, is a Time Series Database based on HBase+Hadoop, is a distributed system for the acquisition and distributed analysis of large sets of temporal data, such as scientific or meteorological measurements.
RRDTool, I would like to mention it even if it should not be included in tools for BigData, because it was not created to work with large amounts of data, but it is well suited for graphic representations, it has finite dimensions of its own data series, but it continuously calculates statistics with each new entry of new data.
MySQL for Big Data, MySQL the database par excellence of the Open world, the most used in the world, is no less, even if the current trend sees a strong push for the NoSQL world, but MySQL has had features for big data for a very long time, just think of the possibility of partitioning data to speed up queries, then the scalability of MySQL is known to most.
Talend has a series of top-notch products that you can download, basing your business on support, consulting, training and certification.
Here you can find recent tutorials that you can check out for free.
Scipy, a series of Python-based libraries for math, science, and engineering.
gnuplot, a powerful CLI for a variety of operating systems to depict data in powerful and beautiful graphs
R, is a software environment for statistical calculation and graphics, linked to this tool we also find some interesting rOpenSci packages
mongodb, is the best known of the NoSQL databases, highly scalable, and with Map/Reduce features, i.e. with a programming model for processing large sets of data in parallel, with algorithms distributed in clusters.
Couchdb, I think you can call it a competitor to MongoDB.
Neo4j, is a scalable, robust and fully ACID graph-database .
Presto is an open-source distributed search engine for executing analytical interactive SQL queries on data sources of all sizes ranging from gigabytes to petabytes.

We’ll stop here for now, but we’ll keep updating the article.